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ABSTRACT 

The corpus of descriptive terminology associated with 
achievement testing has expanded considerably in recent years^ in 
large part due to the heightened interest in absolute and/or direct 
metrics for interpreting test performance plus the development of 
more rigorous strategies for specifying test content. Widely 
prevalent disagreement about terminology reflects a lack of 
conceptual clarification and may inhibit the development of theory 
and practice. Distinctions commonly made between criterion referenced 
and norm referenced tests turn out to be inaccurate^ since it appears 
that both content and norm referenced interpretations can apply to 
scores on any type of achievement test. Rather^ the particular manner 
in which a given test can and should be interpreted turns out to be a 
function of the mode by which test content is specified and the 
function for which the test is to be used. All approaches to the 
interpretation of achievement test scores are classified as either 
domain referenced or norm referenced^ with reference to a criterion 
or standard viewed as a special case of the former. Finally^ it is 
argued that normative interpretations can and in many instances 
should be made of scores which are referenced directly to content^ 
including mastery scores. (Author/BW) 
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The corpus of descriptive terminology associated with the characteristics 
of achievement tests has expanded greatly in recent years (cf. Alkin, 1974). 
Much of this expansion derives from the heightened i^^terest in absolute and/ 



CD 

\jlJ or direct metrics for interpreting test perforfnance as well as in the devel- 

opment of more rigorous strategies for defining test content and specifying 
item characteristics- Disagreement is widely prevalent in the field over 
the distinctions represented by these new, or sometimes resurrected, terms. 
Ebel (1971) has even argued that criterion-referenced measurement was tried 
out and abandoned early in the history of testing. 

This paper will argue that recent trends in testing theory and practice 
reflect a serious attempt to build new types of tests thii^; l^^nd themselves 
to modes of interpretation referenced directly to contend, 7::M/or performance. 
Unfortunately, our terminology often seems to Llur important distinctions 
and equally important similarities between "new" and "old" approaches, A 
successful delineation of juch critical differences and similarities 
should lead to conceptual clarification and perhaps contribute to the 
development of theory and practice in the measurement of achievement. 

Glaser's (1963) discussion of norm (NRT) and criterion-referenced 
(CRT) testing emphasized the distinction between interpreting test scores 
in terms of what a person can do in terms of actual performance vs. how 
well a person does as compared to other people. The distinction has been 
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useful, but is now commonly applied in a way which suggests that a given test 
is invariably either in cne category or the other. Inherent in Glaser's 
original formulation, and abundantly clear later on (Glaser and Nitko, 1971), 
is the fact that the distinction refers both to (a) the way in which test 
content is specified and (b) the kinds of interpretations that can be made 
of the resulting scores. Finally, in typical usage, the NRT vs. CRT dis- 
tinction ignores a third type of interpretation that can be made of test 
scores--one that is referenced to content or performance directly, but 
which does not incorporate the notion of a "criterion." 

All achievement tests, whether viewed by their developers as NRT or 
CRT, are in many situations interpreted both in terms of the "what" and 
"how well" question. By no means, for example, do we interpret the tradi- 
tional standardized achievement test solely in terms of norms. To say that 
"Johnny scored at the 50th percentile for his age/grade group in terms of 
national norms" would immediately bring the response, "Scored at the 50th 
percentile on what?" The test turns out, of course, to have a title, but 
if the manual is adequate there will also be something referred to as a 
content-process matrix (C/P matrix) and even an index relating sets of items 
to Categories in that .r=itrix. Though obviously subjective and non-quantitative 
this information is just as relevant to the test's interpretation as the 
numerically expressed normative score. 

The "what" aspect of the interpretation of the typical published stand- 
ardized test unfortunately incorpor-^tes great areas of subjectivity and 
vagueness in (a) the way in which the content universe is specified via the 
content/process matrix (Cronbach, 1969), as well as in (b) the criteria 
used to develop items from that matrix (£bel, 1962, Bormuth, 1972). In spite 
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of this, most users of achievement test data have been willing to take it 
on faith that publishers typically develop valid measures of educationally 
important universes of content. 

Ebel (1962) demonstrated that content domains could be specified with 
greater rigor. His "content standard" score referred to the percentage of 
items answered correctly on a test made up of items sampled from such a 
domain. This formulation also anticipated contemporary developments in 
the construction and interpretation of tests of educational achievement 
in the sense of defining a numerical score referenced directly to content 
rather than indirectly to the performance of other individuals. 

Just as content-referenced interpretations can (and often must) be 
applied to what are usually thought of as "norm-referenced" tests, so too 
are norm-referenced interpretations relevant to tests now being marketed 
as "criterion-referenced," "objectives-based," or "domain-referenced," 
There must be some basis for believing that such a test is appropriate for 
a given learner or group of learners. For example, Dahl , (1974 in prepara- 
tion) observed that teachers often make major errors in leveling objectives 
for their students. This is not surprising, since many educational objec- 
tives are actually taught at differing levels of complexity at different 
grade levels, and teachers are also often unaware of the specific pattern 
to entry skills their students possess (Skager, 1969). Displaying sample 
test items is one way of helping the teacher level objectives more accurate- 
ly: Prov- '^--g appropriate normative information would be another, and 
probably Simpler, method from the teacher's point of view. 

More important, it is unrealistic to expect that a statement of the 
"what he can do" variety will in many circumstances be seen as sufficient. 

4 

o 

ERIC 



Parents are likely to be interested in when (e.g,, at what age or grade) 
the typical child "masters" a given universe of content- Evaluation reports 
accountability studies, etc. cannot avoid referencing mastery interpreta- 
tions to relevant comparison groups. 

It is thus argued here that the notion that one type of test is neces- 
sarily interpreted comparatively in terms of other people and the other 
directly in terms of a universe of content is inaccurate, since such inter- 
pretations will be seen to apply to measures presently classified In both 
categories. What a number of researchers and theorists seem to be search- 
ing for are ways of formalizing and objectifying content-referenced inter- 
pretations to a degree that approaches the sophistication of existing com- 
parative or normative interpretations. In other words, instead of a vague 
"content interpretation," it would be desirable as Ebel (1962) suggested, 
to have a score referenced to a content domain and expressed on a numerical 
scale. 

The original distinction between criterion- and norm-referenced test- 
ing obviously anticipated certain pragmatic information neeis arising in 
classrooms oriented toward what has come to be referred to as "mastery 
learning" (cf. Bloom, 1968). Theoretical ly justifiable procedures for 
formulating content-^based decision rules relating to the management of 
instruction appear to be needed. Being able to determine with some degree 
of confidence whether a learner has mastered some domain of content would 
presumably contribute to the orderly and eff-'.ient movement of students 
through the curriculum in such classrooms. 



A Classification System 

A variety of distinctions can be made among contemporary achievement 
tests based on (a) the way in which content is specified and (b) the types 
of interpretations that can be made of the scores. However, an initial 
attempt to use these two characteristics to develop a comprehensive classi- 
fication system, while useful as far as making distinctions was concerned, 
tended to obscure similarities that might exist between instruments fall nig 
in di^'ferent categories. This problem was resolved bv developing a third 
set of categories which reflected the various function^ that tests might 
serve in the classroom J It was then apparent that the specific way in 
which a test can be interpreted is determined by a particular combination 
of intended function and mode or strategy for specifying content. The 
table reflects these relationships. 

Content Specification Mode 

Modes for specifying the content of classroom tests fall into four 
general categories, the first being the familiar content/process matrix 
from which most achievement tests in use today originated. 

The strengths and weaknesses of the C/P matrix are well known (cf. 
Cronbach, 1971). On the positive side, when properly utilized this approach 
does provide tests with broad content coverage and which are capable of 
making reasonably accurate distinctions between individuals. But the test 
developer really cannot know in advance what sorts of mental processes 
examinees will actually utilii^e in arriving at the answer, nor be confident 
that all examinees will use functionally equivalent processes. Partly as a 

am indebted to Chester Harris for suggesting this approach. I am 
also greatly indebted to Robert Brennan, Robert Ebel , and my colleague Richard 
Shavelson for a variety of other pertinent suggestions. 
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result, concern may have shifted away from attempting to describe cognitive 
processes to a pragmatic emphasis on careful specification of the nature of 
the correct response and the conditions under which the response is to be 
elicited 

The C/P matrix is also an imprecise specification strategy in terms of 
its ability to define the limits of the intended content domain. In light 
of the uses for which most contemporary tests were designed, content cover- 
age usually tends to be quite broad. Moreover, great latitude is left up to 
item writers in the determination of what the categories of the C/P matrix 
actually mean. Different item writers working independently might construct 
non-parallel tests from the same C/P matrix (cf. Cronbach, 1969). Finally, 
it may often be difficult to decide whether or not a given item belongs 
uniquely in a specific cell of a C/P matrix. 

These problems have not inhibited the development of quantitatively 
meaningful norm-referenced score interpretations. They do, however, place 
severe limits on the kinds of content-referenced interpretations that can 
be made as well as on their precision. For example, if the content domain 
is not precisely specified, it is pointless to attempt to define m.^stery 
of that domain. 

A second means of specifying test content is provided by the theoretical 
construct . This term is used in the usual sense— in reference to hypothe- 
sized personal characteristics, referenced to one or more psychological 
theories, which, in turn, explain consistencies in the behavior of individ- 
uals in a variety of situations. Classroom tests measuring constructs sucn 
as intelligence, aptitudes, and perhaps cognitive styles, are familiar. 
However, generalized patterns of achievement also may be formulated as 
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theoretical constructs. Cronbach's (1971, p. 463) definition of reading com- 
prehension, which either explicitly or by implication excludes vocabulary, 
reading speed, general information, etc. as irrelevant to the construct is 
a useful illustration. 

Theoretical definitions of constructs, as specific as that developed 
by Cronbach, serve as guides for writing test items. But theoretical con- 
structs are not likely to provide precise specifications in this regard, 
because they refer to generalized characteristics or traits which can be 
measured in a variety of ways, item writers working independently from 
the same construct could easily produce non^-paral 1 el tests, especially in 
the sense of having scores influenced by different kinds of "method" 
variance (cf. Campbell and Fiske, 1959). Clearly defined constructs focus 
on behaviors representative of the construct. The theory in which the con- 
struct is embedded deals with cognitive or affective processes, but, unlike 
the C/P matrix, the construct focusses on what can be observed and measured. 

The function of the theoretical construct is thus deliberately not 

that of defining a precise content domain. Because it is embedded in theory, 

it must relate to other constructs. No matter how many construct validity 

studies are done, there may alwayC be another plausible interpretation of 

scores on a test measuring a given construct. Cronbach suggests, 

"It might sound as if construct validity is either pre- 
sent or absent, but most studies lead to an intermediute con- 
clusion. The reading test may truly require comprehension, 
but it alf^o makes demands on vocabulary." (p. 465) 

Interpretations referenced to «1 precisely delimited domain of con- 
tent are not appropriate for tests derived from theoretical constructs. 
This is not a liability, as will be evident in the later discussion of 
the purpose for which such instruments are likely to be used. 
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The third approach to defining the content of an educational test is 
now commonly referred to as objectives-based . The behavioral or perfor- 
mance objective specifies (a) the conditions which will confront the exam- 
inee and (b) the observable Lehavior on his part which can be taken to 
constitute a correct response to those conditions. Skager (1974 , p. 47, 
footnote) in an earlier pai^^r suggested that the inclusion of a third ele- 
iTient advocated by some--an arbitrary criterion of mastery--is probably 
inappropriate. A test built to measure a given objective may be of differ- 
ent lengths depending on the purpose for which it is to be used. Further, 
there is the already alluded to tendency to confuse the concept of a cri- 
terion or standard with the separate question of how test content is to be 
specified. 

Objecti ves-based test materials are presently being marketed by sever- 
al test publishers in commercial delivery systems. While these systems 
take different forms with different publishers, they represent a new gener- 
ation of educational assessment instrumentation. 

The real question, however, is not whether objectives-based systems 
represent something new, which they most certainly do, but rather with how 
far the behavioral objective can take us in the direction of providing test 
scores which are susceptible to direct, content based interpretations. 
• lilliiiar) (1 974) has recently reminded us that behavioral objectives typically 
leave much latitude up to the item writer. The specificity of objectives 
currently in use also varies widely. Sample objectives from the National 
Assessment of Educational Progress listed by Wilson (1974, p. 30) are behav- 
ioral in the sense of referring to observable actions (though in very gen- 
eral terms) without incorporating specifications about conditions. NAEP 
objectives are supplemented by "exercise prototypes" specifying response 
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mode and other conditions as well as by sample exercises designed to provide 
guidelines. These additional specifications, while made by committees of 
experts and extensively reviewed, are to some degree arbitrary since another 
panel of experts might have generated somewhat different specifications. 

Dahl (1971) demonstrated that judges rarely if ever made errors when 
asked to classify randomly grouped items under the objectives they were 
written to measure. It seems unlikely that the level of accuracy would 
be the same if judges were asked tc classify test items in the appropriate 
cells of a typical C/P matrix. The behavioral objective undoubtedly has a 
great advantage in terms of clar;ty of specification. Also, this particu- 
lar content generation mode makes no attempt to specify the process by 
v/hich an examinee is to obtain the correct answer. But it is still rea- 
sonable to argue that objectives defining content domains containing many 
items may not define those domains uniquely. Rational analysis must be 
used to derive sets or systems of interrelated objectives from broad sub- 
ject-matter areas. Each and every objective represents a decision about 
what is important. The arbitrariness interwoven into this process is self- 
evident. 

Millman (1974) describes Popham's attempt to provide practical but 
reasonably precise guidelines for generating items from objectives. This 
author's "amplified" behavioral objectives are supplemented by statements 
describing .he testing situation, the characteristics of the response alter- 
natives, and tlie criteria for scoring. One critical difference between 
Popham's amplified objective and Hively's item form to be discussed next 
is that the former does not include replacement stimuli. While the rules 
also are looser than those formulated by Hively and his associates, ampli- 
fied objectives do appear to offer significantly more guidance to the item 
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writer than do ordinary behavioral objectives. The criticism that those 
rjles were derived arbitrarily is still relevant. 

It is also evident that amplification does not rule out the possi- 
bility that a given item or set of items will be defective from a techni- 
cal point of view. If the rules for writing items are faulty the item& 
will also be faulty. Thus, the amplified objective of Popham's provided by 
Millman (1974, p. 34) for illustrative purposes (a) contains a specific 
determiner (correct answer inevitably a longer, less commonly used word 
than incorrect answer), (b) has the examinee putting an "X" (in effect, 
crossing out) through the correct rather than the incorrect word, and (c) 
has instructions to the examinee which may not communicate very accurately 
what is intended in the objective. It is perhaps easy to forget that the 
behavioral objective, even when amplified, does not circumvent the prob- 
lem of technically defective items. Brennan (1975) has already called 
our attention to this issue as well as explored alternative procedures of 
item analysis appropriate to tests developed from objectives. 

The last content specification mode incorporates procedures or models 
proposed by various authors, all of which involve the development and uti- 
lization of formal item generation rules . While diverse both in approach 
and specificity, all have the common intent of achieving a logical, sys- 
tematic, and replicable means for generating t est items representative of 
a defined content domain. All in one way or another appear to devolve at 
least in spirit from Ebel's (1962) concept of the "content standard" test 
score. The latter was to be directly referenced to a set of tasks defined 
so systematically that "...independent investigators would obtain substan- 
tially the same scores for the same persons." (p. 16) Ebel also described 
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a vocabulary test developed by applying systematic rules for sampling v/ords 
from a dictionary by way of illustration, although the example itself admit- 
tedly did not go vory far in exploring the potential of the approach. 

Hively and his associates (Hively, et al , 1973) have developed perhaps 
the best known item-generation mo^el within the context of a curriculum 
evaluation project. It is significant that a statement taken from an early 
project workinq paper written by Hively and quoted in the 1973 monograph 
reflects the goal of quantifying content interpretations quite explicitly. 

"The basic notion underlying domain-referenced achievement 
testing is that certain important classes of behavior .can be 
exhaustively defined in terms of structured sets of domains of 
test items. . .precise definition of a domain and its subsets 
makes statistical estimation (Italics mine) possibl e. " (p. 1 5) 

Coming as it did out of the evaluation of a particular instructional 
program, the approach that Hively and his co-workers eventually developed 
involved an initial process of eliciting from developers of a mathematics 
curriculum statements about curriculum objectives. These statements ulti- 
mately were transformed into definitions of content domains which included 
(a) general descriptions of the task (sometimes in a form close to that of 
a behavioral objective), (b) statements about characteristics of the stimu- 
lus and response, (c) one or more "item form cells" defining each class of 
items in the domain (with classes grouped together because the same set of 
generation rules can be applied to each), (d) the "item form shell" which 
gives rules for constructing item variations from the one or more "replace- 
inent sets" of stimulus elements. Each of the latter, in turn, was refer- 
enced to a particular item form cell. Scoring specifications were also 
provided . 

There is something arbitrary in this process of making decisions 
about the particular item form and the specific elements of the replarement 
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sets. This arbitrariness is at least analogous to the kind of decisions 

that are made (obviously less explicitly) by the iteiti writer working 

directly from a behavioral objective without benefit of generation rules. 

This is certainly recognized by Hively, et al . , 1973. 

"Even the simplest concept or skill has so many potential 
'representative' behaviors that it is impossible to specify 
tlieni all. Arbitrary limits to the population must be im- 
posed." (p. 15) 

But one must credit Hively and his associates for developing a model 
which not only renders the results of such decisions open for all to examine 
(though not the reasoning behind them) and which is genuinely capable of 
objectifying the item generation process to the point where item writers 
working independently should be able to produce parallel tests. Defining 
a content domain that clearly, especially for tasks which appear to be non- 
trivial in the educational sense, is a significant achievement. 

Obviously questions arise as to the appropriateness of the Hively 
model for content domains that are considerably less structured than 
mathematics as well as in developing tests measuring functions at a high 
level of the cognitive taxonomy. (The latter criticism has also been made 
of tests derived from the C/P matrix and the behavioral objective, e.g., 
Ebel , 1971). These questions are not the primary focus here, but they 
certainly bear on the extent to which the model will be used. Likewise, 
the sheer amount of work and expertize that must go into generating a 
significant number of item forms raises questions about cost effectiveness, 
although admittedly alternate forms of the test can be generated virtually 
automatically once the form is constructed. 

Apparently even an approach to content specification as rigorous as 
Hively's can still result in items and tests with traditional types of 
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technical defects. The particular item form chosen by Hively (1973, p. 24) 
to illustrate his approach may result in items which do not really assess 
(at least for some examinees) the competency identified in the general 
task description for the item form. In this particular instance it is 
possible that examinees could produce correct responses without under- 
standing or being able to generalize the concept being assessed. 

A second approach to the generation of items by systematic means 
has been advanced by Bormuth (1970), and there is a link between his and 
Hively' s work. Bormuth has been especially concerned with tying the achieve- 
ment test item as closely as possible to instruction by going directly from 
verbal instructional content without the intermediary of behavioral objec- 
tives a d ''idiosyncratic" decisions by item writers. 

"To develop a science of achievement testing, the proce- 
dures for deriving items from the instruction must be opera- 
tional ized. One way to do this is to regard the test item as 
a property of instruction and the item as being obtained by 
performing some manipulation on the instruction. Thus, an 
operational definition of a class of achievement test items 
is a series of directions which tell an item writer how to 
rearrange segments of the instruction to obtain items of that 
type." (p. 5) 

Bormuth' s approach utilizes linguistic principles to derive various 
item transformations from instructional content. Items are to have a logi- 
cal i^elationship with instruction. It should be possible to state the 
"...exact manner in which the structure of the test item is related to the 
structure of the relevant segment of the instruction" (p. 14). Empirical 
evidence that the itc. . is sensitive to instruction is seen as superficial, 
in that it deals only with "...observations of responses"(p.l4) . 

There is another interesting difference in approach which contrasts 
Hively and Bormuth with Popham and the developers of most objectives-based 
assessment systems. Hively and Bormuth derive test content directly from 
instructional materials and statements. Bormuth is especially explicit, even 
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militant, on this point. He strongly objects to contemporary evaluation 
systems which provide items measuring behavioral objectives derived from 
abstract analyses of content domains. Bormuth maintains, for reasons that 
are not entirely clear, that teachers should not be led to shape instruc- 
tion in the direction of maximizing performance on such objectives. There 
is a difference of opinion here which would make for an interesting debate, 
flany educators have maintained for some time that niuch instruction in the 
schools goes on without clearcut objectives. Tests produced by analysis 
of actual instructional content might be content valid, but fail in many 
cases to meet the addition validity criterion of "educational importance" 
described by Cronbach (1969). Still, Anderson's (1972) point is well tai>.en. 
If we are to measure whether or not the learner comprehends actual instruc- 
tion, then, "...a system of explicit definitions and rules to derive test 
items from instructional statements..." is highly desirable (p. 149). The ' 
general utility of approaches utilizing transformational and other gram- 
mars should continue to be examined, even if such approaches are limited 
to instruction presented via what Shoemaker (1975) refers to as the "nat- 
ural language" (p. 134). 

Bormuth 's formulations are also subject to questions about efficiency 
and practicality, as well as about generality of application. But he does 
suggest another path toward the precise definition of content domains which 
yields rigorous and direct (non-comparative) interpretations of performance. 

It is now appropriate to relate the four basic modes of specifying 
content to the functions for which tests are used in the classroom. 

Functions £f Testing in the Classroom : Managerial 

There are two major functions for which tests are used in the classroom- 
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with another child who answered all of the questions correctly. Summated 
over the time such information may be used for evaluating either the 
instruction or the learner. But this kind of evaluation has nothing to do 
with deciding whether to assign extra practice in the same learning mode 
or to select a different approach tn instruction for the unsuccessful 
child. Performance alone is sufficient. It is directly ' . oretable. 

The use of tests for (b) and (c) above is generally well understood, 
though neither as widely or as systematically practiced as might be hoped. 
However, applying tests diagnostically to assign instructional modes opti- 
mal for given learners is at present more hope than reality. Hambleton (1974) 
in his review of three of the most widely disseminated individualized 
instructional programs concludes, "...while nearly all developers of indi- 
vidualized programs describe this feature, there are few demonstrations of 

^The extensive use of tests in the schools for purposes not directly 
related to instruction is irrelevant to this discussion. However, testing 
for guidance or clinical diagnosis is analogous to diagnostic testing in 
the classroom context. Using tests for purposes of selection has its ana- 
logue in learner evaluation. 

Usage of the terms "diagnosis" and "placement" follows that of Glaser 
and Nitko (1971). Other authors, e.g., Cronbach (1971) have used these 
terms differently. 
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should be interpreted turns out to be a function of (a) the niode by which 
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significant interactions between aptitudes and instructional modes" (p. 393). 
But the function itself is potentially of grea*?- importance, even if know- 
ledge lags behind instructional theory. 

Aptitude tests "in the past have been seen as likely candidates for 
diagnostic use. Measures of cognitive styles, falling in the region be- 
tween aptitude and personality, may also be promising, and Cronbach (1975) 
has argued recently that pure personality measures may have greatest pro- 
mise of all. It is even conceivable that achievement tests could be used 
for diagnostic purposes, not in the sense of establishing entry skills for 
placement, but rather as indicators of potential transfer effects from a 
different learning domain that might interact with an instructional mode. 
Competency in the English language, for example, is a relevant basis for 
assigning children to monolingual or bilingual classrooms. 

Tests used for diagnostic purposes have to be constructed so as to 
differentiate among groups of students. Determining whether or not a test 
will be useful for diagnosis involves prediction studies, specifically the 
search for regression lines (achievement on predictor) which cross for 
different instructional modes. However, recently Cronbach (1975) has 
warned that actual relationships may be considerably more complex than 
simple first order interactions. The theoretical construct is the most 
likely content-generation mode for diagnostic tests, although the C/P 
matrix cannot be ruled out, particularly if any g^^neralized measures of 
achievement turn out to be useful in this function. 

We can now turn to the formative and placement use of tests in instruc- 
tional management. Placement tests are likely to be relatively long because 
they typically cover a spectrum of instructional objectives. The three 
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well -known instructional models reviewed by Hambleton (1974) (TPI, PLAN, 
and Mastery Learning) all were organized around "...a curriculum defined 
in terms of behavioral objectives arranged into small cluste. s or units 
around a common topic or theme" (p. 392). Formative tests, (referred to 
as "diagnostic-progress" tests by Hambleton), are shorter instruments 
designed to assess one or more objectives within a unit of instruction. 

The most appropriati^ content specification modes for these two types 
of instruments would be behavioral objectives or formal item generation rules, 
since precision in the definition of the content domain is highly desirable. 
The question being asked in the classroom is whether or not the learner has 
mastered the domain in question. While formal item generation rules give 
more precision in the sense that the particular form selected for the items 
is explicit, this does not necessarily mean that a clearly stated objective 
accompanied by a sample item would not provide a definition of the domain 
adequate for the typical test user. 

Relationships between managerial functions and content specification 
modes are shown by "X's" in the upper portion of the table. Thus, the C/P 
matrix and the theoretical construct are identified as the most likely con- 
tent generation modes for tests used for diagnosing which instructional 
treatment is most appropriate for given learners. Objectives or formal 
item generation rules are seen as appropriate modes for generating test 
content in the case of placement and formative functions. There is no in- 
tent here to portray one content generation mode as superior to the others. 
Each has its uies, but there are ties between function and how test con- 
tent is l-keiy to be specified. 

Functions of Testing iin the Classroom : Evaluation 

There are really two different evaluative functions within the classroom. 
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The first involves evaluating the learner for grading, promotion, awards, 
and the like. The second involves evaluating the instruction , and in the 
decade of "accountability," perhaps even the teacher. This differentiation 
is made in the left hand margin of the Table, Realistically, it must be 
admitted that in most classrooms evaluation tends to focus on the former. 
The same types of measures are used for both, although obviously the tests 
themselves might differ in certain characteristics depending on whether 
or not group or individual data is needed, whether matrix sampling is 
appropriate, etc. 

In contrast to the situation for managerial decision-making, tests 
developed under any of the four basic content generation modes can be used 
for evaluation as indicated by the "X's" in the Table. This by no means 
implies that content generation mode is irrelevant in developing evaluation 
instruments. The pa''ticular evaluative question to be asked in a given 
situation will var^ depending upon the philosophy of evaluation held by 
wh ver will interpret those data. The nature of the question in turn re- 
lates directly to content generation mode. For example, if the evaluation 
focusses on individual learners and the question to be answered is how 
well those learners stand on the generalized objectives of the course or 
how well they are able to transfer what they have learned to new situa- 
tions, then the C/P matrix would probably be chosen because of its simpli- 
city and generality. If, on the other hand, the question is phrased in 
tenns of how many of the specific objectives of the curriculum have been 
mastered in a given period of time, or in terms of Brennan's (1975) notion 
"instructional time" (how long it takes the learner to master a given 
objective or set of objectives), then the C/P matrix or the theoretical 
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construct are clearly not appropriate content specification modes. Tests 
based on objectives or formal item generation rules are needed if these 
kinds of questions are to be answered. Clearly, while any of the four types 
of tests can be used for evaluation of learners or of instruction, there is 
a close relationship between the nature of the evaluative question posed in 
a given situation and the mode by which test content is most appropriately 
defined. 

Having looked at the relationships between content generation modes 
and the functions for which tests are used, it is now time to turn to i:he 
matter of how scores on the four types of tests may be interpreted. This 
is undoubtedly the area in which contemporary terminology and the uricfar- 
lying conceptions which it represents leads to the greatest con 
distinctions between different types of tests. 

Interpreting Test Scores: D omain vs . Norm-Referenced Interpretation 

There appear to be two fundamentally different ways of interpreting 
test scores. The first is labeled here as a domain-refer«3r)ced (DR) inter- 
pretation and refers directly to content or performance without regard 
to comparisons among individuals. In contrast, norm-referenced (NR) inter- 
pretations derive their meaning from the relative standing of individuals 
compared to one another. This distinction is obviously not new, although 
it should be noted that the term "criterion-referenced" has not been used 
at this level of generality. 

The last two rows of the table list various types of DR and NR inter- 
pretations that have either been in use for some time, or whose use is 
conceivable given the newer approaches to the generation of test content. 
The types of score interpretations have been listed in relationship to 
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the columns of the table correspunding to content generation modes. 

Perhaps the most important observation to be made here is that DR 
interpretations have long been available for traditional types of tests 
whose content is derived from C/P matrices or theoretical constructs. The 
DR row under these two content specification modes lists the familiav^ 
expectancy score which, as Cronbach (1970) suggests, refers to actual per- 
formance rather than to comparative standing. Even the predicted grade 
point averages provided by some college admissions testing programs are 
thus subject to direct, rather than comparative, interpretations, e.g., 
the probability of having a "C" average or better at the end of the fresh- 
man year at college X. Likewise, the representative item cluster score 
describes EbeTs (1962) proposition to the effect that normative test 
scores should be supplemented by content-based interpretations based on 
displays of representative items typically passed by individuals obtain- 
ing various scores on the test. 

These two types of DR interpretations can also be applied to tests 
generated from theoretical constructs. In addition this later mode is sus- 
ceptible to interpretation in terms of solute scores of the Guttman or 
Rausch variety- Tucker's (1953) prcDOsition IV on the characteristics of 
an "ideal" test minimizing the importance of reference groups makes this 
clear. 

''The scores (on such a test) indicate extent or degree of 
some trait which exhibits homogeneity in the behavior of exami- 
nees" (p. 27). 

Angoff's (1971) discussion of Guttman, Rausch, and Tucker's models 
does not reflect any particular interest on the part of any of these 
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theorists as to how test content is to be generated initially. The empha- 
sis is rather on whether a given set of items meets the various criteria of 
scalability. But Guttman's early work was in attitude measurement, again 
Suggesting the theoretical construct. Absolute scales, while referring to 
difficulty in the case of achievement tests, do so independently of any 
population of examinees. 

Finally, allowance should be made for the fact that diagnostic 
interpretations for the purpose of selecting instructional mode may be 
made from tests generated from these first two content specification modes. 
Here almost any kind of numerical score scale might be used, since the 
intent is to divide learners into two or more groups. 

Several new types of OR scores are pertinent to content specification 
modes based on objectives or formal item generation rules. Ebel's (1962) 
content standard score, while proposed some time ago, remains the progen- 
itor of this category of interpretations. Ebel used the term "domain" 
and his content standard score can be taken as a point estimate of the 
examinee's competency with respect to that domain. Cronbach's (1970) 
content reference score refers to "...level of performance on content that 
is like the test" (p. 85). A score indicating how many words an examinee 
can type over a given period of time without making errors lends itself 
directly to incorporation into precise decision rules for training or 
selection. 

With precise specification of a content domain one can envision two 
kinds of OR scores with very useful properties. The first is a score 
estimating the proportion of items in the content domain that would be 
passed by the examinee were all of the items to be administered. This is 
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ref erred to in the Table a.: a domain score estimate , after Millman (1 974). 
The second would provide a "sign" interpretation as Harris (1974) has 
characterized the much-talked about notion of mastery, and is labeled 
mastery > domain - referenced . Here (at last) the much used concept of a 
"criterion-referenced" score obviously applies, although the term "mastery" 
seems much more descriptive of the particular type of criterion desired. 

Whatever one's preferences may be with respect to terminology, it 
should be duly noted that the concept is one of a criterion-referenced 
score rather than a criterion-referenced test. Cutoff points reflecting 
decision criteria could be established for all of the DR scores discussed 
up to now, including those applicable to tests generated from C/P matrices. 
Millman (1974) even discusses what he terms the "criterion-referenced 
differential assessment device" or CRDAD. This is an objectives-based 
test, but one in which items have been selected for discriminating power. 
Scores on this type of test could be referenced to a criterion or standard 
but would no longer represent the content domain defined by the objective, 
since some proportio of the items in the domain have been eliminated. 

However, only the last two content generation mod^s appear to be 
amenable to conce.ot of a mastery criterion, Millman's (1972) paper used 
the test score as a point estimate of the domain score and applied the 
b^:nonial theorem to get at the probabilities of correct and incorrect 
classification of examinees for different mastery criteria and for tests 
of different lengths. Harris (1974) illustrated the relevance of sequen- 
tial testing procedures for fixed length tests which can be regarded as 
samples of items from a defined domain. Novick and Lewis (1974) applied 
the Bayesian approach to the same problem. Lewis, Wang, and Novick (1973) 
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applied Bayesian principles to estimating domain scores. 

Davis and Diamond (1974) point out that " . . .rrathematical ly , we regard 
it as impossible for an examinee who has complete knowledge of all items 
in the population to mark incorrectly any item drawn from that population" 
(p. 134). In the abstract, mastery is perfect performance, nothing less. 
Practically, of course, we knov^; that individuals who are in fact masters 
of some domain of content may answer items incorrectly because of careless- 
ness, distractions, and the like. These authors all view mastery in terms 

a domain of possible items. Hence the concept of a domain-referenced 
mastery criterion seems useful . 

Statistical methods described by these authors could be applied to 
tests generated by any of the four content-specification modes in the 
Table. However, the resulting estimates would be seriously misleading in 
the case of the first two modes. Being able to pass even 90% of the items 
in the "domain" represented by a test generated through the use of a C/P 
matrix still leaves open the possibility that there are some skills mea- 
sured by the test on which the examinee has no competence at all. Domain- 
referenced interpretations only become meaningful when there is an appro- 
priate degree of specification of that domain. 

Unfortunately, the typical rule for determining mastery on tests being 
published presently is quite arbitrary in the sense of being confined to 
the test itself rather than being referenced to the domain. "Eighty per- 
cent mastery" means getting eight out of tem items correct on the test, 
period. Because such interpretations are now commonly utilized, a second 
"sign" interpretation, mastery , test-referenced has been added. 

NR type interpretations are mainly familiar, especially as they apply 
to traditional types of tests. The common varieties, percentiles , age/grade 
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equivalents. standard score scales, are listed under the first two columns 
of the table along with arbitrary scales . The latter term was used by 
Angoff (1971) to describe score systems tied to a convenient reference 
group rather than a systematic sample. The Scholastic Aptitude Test, 
referring to the 1941 examinee population, is a familiar example. A number 
of other types of scores are described in Angoff's exhaustive treatment of 
the topic, but discussing them would not contribute to the distinctions 
being made here. 

Turning to the last two content-specification modes, it is apparent 
that NR, as well as DR, interpretations can be applied to tests derived 
from objectives or by means of item generation rules. Here again we must 
contradict the popular notion that there are two types of tests, one always 
interpreted in terms of norms, and the other in terms of a performance 
based criterion. Of particular interest would be (a) domain score esti- 
mates referenced to a normative scale ( domain score norm), and (b) a 
mastery norm providing a comparative interpretation of mastery such as, 

"objective is mastered by 50% of the population at grade level 

5.3." This kind of normative interpretation (whether in the form of a 
percentile or an age/grade norm) applied to a "Vigorously specified content 
domain, would be very useful. It would reflect what schools are accomplish- 
ing in a way that is tied bot i to instructional content and to relative 
standing. It could also summarize the teacher's evaluation of student 
performance in a manner that is far more informative than the maligned, 
but tenacious, letter grading system. 

The choice of one kind of score interpretation over another obviously 
depends in part on the particular function for which the information is 
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to be used. However, the conceptions of management and evaluation held in 
a particular time and place also are determining factors. Modern approaches 
to the individualization of instruction such as those reviewed by Hambleton 
(1974) certainly stress the use of domain-referenced interpretations for 
Plac-r^ent and formative decision-making in the classroom. Indeed, instruc- 
tional philosophy has provided the primary stimulus to the development of 
nevve- approaches to the generation of content and interpretation of test 
scores. Traditional approaches to classroom management can be expected to 
continue to be associated with a preference for NR interpretations, in 
spite of the fact that this type of information does not seem to be as 
useful for these two managerial functions. 

In the case of evaluation, v/hether focussed on the pupil or the 
inst'-uction, both DR and NR interpretations appear to be relevant because 
the two provide different and, taken alone, incomplete information. This 
observation was nicely illustrated in a recent newspaper article. It 
appears that the research branch of a large school district had published 
a report demonstrating that median percentile ranks on state mandated 
achievement tests for students in the district had remained at the same 
level or risen somewhat over the last few years. This was of course taken 
as a sign that the district was at the very least holding its own, and in 
some cases improving. The reporter, however, noted that raw score medians 
over the same period had actually gone down at most grade levels. In 
other words, students on the average were getting fewer questions correct, 
but the decline was not as precipitous as that occuring in other districts. 
Comparatively speaking the district came off rather well. The reporter 
had obviously stumbled on the utility of a DR type interpretation, although 
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no formal means for making such interpretations were available for the 
tests in question. 

It should be clear that, taken alone, both NR and DR Interpretations 
can be misleading, and equally so. In the case of evaluation one should 
be interested in both the ''What?" and "How well?" questions. Mastery of 
all the goals and objectives might be achieved merely because those goals 
and standards were deliberately set low. Scoring above the 50th percen- 
tile might conceal real declines in achievement. 

Content Specification Modes and Item Selection Strategies 

In traditional test assembly both judgmental and empirical considera- 
tions enter into the determination of item quality. First, items are 
scrutinized for violations of traditional rules relating to the construc- 
tion of achievement test items, such as those summarized by Gronlund (1968). 
Items found to contain specific determiners, for example, are modified or 
discarded. 

It was suggested earlier that these familiar rules are also applicable 
to tests generated by the newer approaches to content specification. Tra- 
ditional principles of item construction should be applied with care to 
items derived by means of the newer content generation modes, since addi- 
tional empirical checks on item quality provided by difficulty and discri- 
mination indices may not be appropriate. Certainly eliminating some items 
because they are "too easy" or 'too difficult" for a given population, 
would invalid.ne a test as a measure of the domain defined by the objective 
or item generation rule. 

Brennan (1974) has presented a variety of approaches to the analysis 
of what he terms as "criterion-referenced and mastery items." But where 
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items are derived from item qenrraticn rules these kinds of data may 
suggest that there is sonethinrj v.rong with the rules themselves. In other 
words, the domain itself might require redefinition. How this problem 
(should it arise) would be dealt with under an approach utilizing grammati- 
cal transformation of actual subject matter is an interesting question. 

Items included in tests developed from C/P matrices and theoretical 
constructs obviously must discriminate among individuals. Technical deci- 
sions based on traditional procedures of item analysis, while appropriate 
and useful, do modify the content domain defined jointly by the content- 
specification mode and the item writer. Cox (1965) provides convincing 
empirical evidence of this fact. 

Items measuring theoretical constructs should correlate with other 
items v/ritten to measure the construct. In factor analytic studies such 
iteirs should also show a reasonable degree of independence from items pre- 
sumably measuring different constructs. In the case of constructs defining 
domains susceptible to absolute interpretations, items must meet whatever 
scalability criteria are imposed. 

It appears that none of the above statistical criteria are invariably 
relevant to the study of items from content domains defined by objectives 
or item generation rules. When the items in a given domain vary in diffi- 
cjU/ -"or a given population of examinees, a high degree of homogeneity 
among items derived from these two specification modes v/ould probably be 
observed. But as long as the items are congruent with the domain specifi- 
cation, high inter-item correlations are not a requirement, as Cronbach 
(1959) i Dints out. 

It appears, then, that determination of the quality of items based 
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on objectives or derived by means of item generation rules still calls for 
judgmental application of long-established principles of item construc- 
tion. 

Probably we have not yet had enough experience with tests based on 
objectives or derived from generation rules to write the final word how 
item quality is to be determined for tests of this type. It appears that 
procedures for assessing the quality of objectives-based or rule-generated 
tests and the domains from which they are derived are the next area to be 
explored in relation to the classification system proposed here. We refer, 
of course, to the analogues to traditional concepts of validity and relia- 
bility. Brennan's (1974) report as well as the ongoing investigations of 
Chester Harris and his students into approaches to item analysis which 
assess "sensitivity to instruction" are both highly relevant to this issue. 

SUMMARY 

The corpus of descriptive terminology associated with achievement 
testing has expanded considerably in recent years, in large part due to 
the heightened interest in absolute and/or direct metrics for interpreting 
test performance plus the development of more rigorous strategies for 
specifying test content. Widely prevalent disagreement about terminology 
reflects a lack of conceptual clarification and may inhibit the develop- 
ment of theory and practice. 

Distinctions commonly made between "criterion" and "norm-referenced" 
tests turn out to be inaccurate, since it appears that both content- and 
norm-referenced interpretations can apply to scores on any type of achieve- 
ment test. Rather, the particular manner in which a given test can and 
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should be interpreted turns out to be a function of (a) the mode by which 
test content is specified and (b) the function for which the test is to be 
used, lour content specification modes were discussed in this paper in 
conjunction with five functions, the latter classified as either managerial 
or evaluative. 

All aoproaches to the interpretation of achievement test scores are 
classi ^^-"ed as either "domain-referenced" or "norm-referenced," with refer- 
ence to a criterion or standard viewed as a special case of the former. 
Finally, it is argued that normative interpretations can and in many 
instances should be made of scores which are referenced directly to con- 
tent, including mastery :scnres. 
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